Words'n'numbers
Tokenizing strings of text. Extracting arrays of words and optionally numbers and emojis / emoticons from strings. For Node.js and the browser. When you need more than just [a-z] regular expressions. Part of document processing for search-index and nowsearch.xyz.
Inspired by extractwords
Initiating
Node.js
const wnn = require('words-n-numbers')
Browser
<script src="wnn.js"></script>
<script>
</script>
Use
The default regex should catch every unicode character from for every language.
Only words
let stringOfWords = 'A 1000000 dollars baby!'
wnn.extract(stringOfWords)
Only words, converted to lowercase
let stringOfWords = 'A 1000000 dollars baby!'
wnn.extract(stringOfWords, { toLowercase: true })
Predefined regex for words and numbers, converted to lowercase
let stringOfWords = 'A 1000000 dollars baby!'
wnn.extract(stringOfWords, { regex: wnn.wordsNumbers, toLowercase: true })
Predefined regex for words and emoticons, converted to lowercase
let stringOfWords = 'A ticket to 大éĒ costs ÂĨ2000 đđ đĸ'
wnn.extract(stringOfWords, { regex: wnn.wordsEmojis, toLowercase: true })
Predefined regex for numbers and emoticons
let stringOfWords = 'A ticket to 大éĒ costs ÂĨ2000 đđ đĸ'
wnn.extract(stringOfWords, { regex: wnn.numbersEmojis, toLowercase: true })
Predefined regex for words, numbers and emoticons, converted to lowercase
let stringOfWords = 'A ticket to 大éĒ costs ÂĨ2000 đđ đĸ'
wnn.extract(stringOfWords, { regex: wnn.wordsNumbersEmojis, toLowercase: true })
Predefined regex for #tags
let stringOfWords = 'A #49ticket to #大éĒ or two#tickets costs ÂĨ2000 đđđ đĸ'
wnn.extract(stringOfWords, { regex: wnn.tags, toLowercase: true })
Predefined regex for @usernames
let stringOfWords = 'A #ticket to #大éĒ costs bob@bob.com, @alice and @įžæ ÂĨ2000 đđđ đĸ'
wnn.extract(stringOfWords, { regex: wnn.tags, toLowercase: true })
Custom regex
let stringOfWords = 'This happens at 5 o\'clock !!!'
wnn.extract(stringOfWords, { regex: '[a-z\'0-9]+' })
API
Returns an array of words and optionally numbers.
wnn.extract(stringOfText, \<options-object\>)
Options object
{
regex: '[custom or predefined regex]',
toLowercase: [true / false]
}
Predefined regex'es
wnn.words
wnn.numbers
wnn.emojis
wnn.wordsNumbers
wnn.wordsEmojis
wnn.numbersEmojis
wnn.wordsNumbersEmojis
wnn.tags
wnn.usernames
Languages supported
Supports most languages supported by stopword, and others too. Some languages like Japanese and Chinese simplified needs to be tokenized. May add tokenizers at a later stage.
PR's welcome
PR's and issues are more than welcome =)